Investigating the Gapminder World Dataset: On Sustainable Development¶

Table of Contents¶

  1. Introduction
  2. Data Wrangling
  3. Exploratory Data Analysis
  4. Conclusions

Introduction¶

The Gapminder World dataset is about people from around the world. It describes their lives using data from trustworthy sources, such as the UN. The raison d'être of this dataset is to reduce the gap between between popular misconception and reality.

Certain topics are rife with misconception. One such topic is human progress & development. People underestimate the developing world, especially in regards to sustainable development—including people in developing countries.[1][2] Likewise, people may overestimate the sustainability of developed countries. For instance, Northern Europeans overestimate the role of clean energy in the world today—ironic, given their proclivity for burning wood and oil for warmth.[3][4] Misconceptions like these could lead to trouble. If Brits were to assume that the clean energy revolution had already occurred, then they might take their foot off the gas—so to speak—on thwarting climate change.

These sorts of misconceptions can be addressed using the Sustainable Development Index (SDI), which includes 164 countries going back three decades. The SDI divides each country's Human Development Index by its "Ecological Impact Index," a metric for excessive pollution or overconsumption of raw materials.

$$ \large{SDI = \dfrac{HDI}{EII}} $$

Thus, the SDI penalizes countries for using up more than their equitable share of resources. But because the baseline $EII$ is 1, the SDI does not reward countries that use less than their alloted resources, either.[5] For these countries, SDI functions just like HDI.


This investigation uses ver. 2 of the SDI data, which includes three tables of supporting data. Two are used in the investigation:

  • Gross National Income per capita (GNIpc)
    • Development and per capita productivity are strongly, positively correlated, ditto productivity and income.[6]
  • Material Footprint per capita (MFpc)
    • The MFpc of a country is the sum of all, raw materials—bio, petro, mineral—that go into producing the goods consumed by its people. This includes materials extracted in other countries.
    • MFpc is strongly, positively correlated with overall environmental impact.[5]
  • Population (not used)

This investigation will set out to determine just how developed a country can get before it starts using up more than its share of resources.

Preliminary Questions¶

  • What are the top countries when it comes to sustainable development?
  • How do they stack up against the most developed countries?
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go

Data Wrangling¶

General Properties¶

Sustainable Development Index (SDI)¶

In [2]:
# The `read_csv()` method includes a number of parameters for selectively
# "reading in" CSV files, which is not only easier on the computer but can give
# the programmer a head start on wrangling the data.
sdi = pd.read_csv('_Sustainable Development Index-Dataset - v2 - Unpivot-SDI'
                  '-for-countries-etc.csv', header=3,
                  usecols=['name', 'Year', 'SDI(%)'])[['name', 'Year', 'SDI(%)']]
sdi.head()
Out[2]:
name Year SDI(%)
0 Afghanistan 1990.0 32.5
1 Afghanistan 1991.0 33.1
2 Afghanistan 1992.0 34.0
3 Afghanistan 1993.0 33.5
4 Afghanistan 1994.0 33.0
In [3]:
sdi.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5790 entries, 0 to 5789
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    4611 non-null   object 
 1   Year    4611 non-null   float64
 2   SDI(%)  4611 non-null   float64
dtypes: float64(2), object(1)
memory usage: 135.8+ KB
  • There are 5790 entries per column, only 4611 of which are valid.

  • The year data is of the wrong type, float64.*

*Float64 is normally used to represent fractional values—hence the `.0`'s at the end.

Gross National Income per capita (GNIpc)¶

In [4]:
gnipc = pd.read_csv('_Sustainable Development Index-Dataset - v2 -'
                    ' data gdpcapcppp@fasttrack year countries_etc.csv',
                    index_col='name')
gnipc.head()
Out[4]:
geo time Income per person
name
Afghanistan afg 1800 603
Afghanistan afg 1801 603
Afghanistan afg 1802 603
Afghanistan afg 1803 603
Afghanistan afg 1804 603
In [5]:
gnipc.info()
<class 'pandas.core.frame.DataFrame'>
Index: 46995 entries, Afghanistan to Zimbabwe
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   geo                46995 non-null  object
 1   time               46995 non-null  int64 
 2   Income per person  46995 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.4+ MB
  • The geo-id column seems redundant.

  • The year data is labeled, somewhat confusingly, as 'time.'

  • The 'Income per person' column does not specify the unit of measure.

Material Footprint per capita (MFpc)¶

In [6]:
mfpc = pd.read_csv('_Sustainable Development Index-Dataset - v2 -'
                   ' data_matfootp_cap#v1@fasttrack_year_countries_etc.csv',
                    index_col='name')
mfpc.head()
Out[6]:
geo time Material footprint per capita (tonnes)
name
Afghanistan afg 1990 2.46
Afghanistan afg 1991 2.81
Afghanistan afg 1992 2.06
Afghanistan afg 1993 1.87
Afghanistan afg 1994 1.60
In [7]:
mfpc.info()
<class 'pandas.core.frame.DataFrame'>
Index: 4816 entries, Afghanistan to Zimbabwe
Data columns (total 3 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   geo                                     4816 non-null   object 
 1   time                                    4816 non-null   int64  
 2   Material footprint per capita (tonnes)  4816 non-null   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 150.5+ KB
  • The year data is labeled 'time' here as well.

  • Geo still seems redundant.

  • 'Material footprint per capita (tonnes)' is much too long a name.

Trimming the Data¶

Sustainable Development Index (SDI)¶

  • Drop null values.

  • Convert the year data to integers.

  • Set the name column as the index.

In [8]:
sdi.dropna(axis=0, inplace=True)
sdi['Year'] = sdi.Year.astype(int)
sdi.set_index('name', inplace=True)
sdi.head()
Out[8]:
Year SDI(%)
name
Afghanistan 1990 32.5
Afghanistan 1991 33.1
Afghanistan 1992 34.0
Afghanistan 1993 33.5
Afghanistan 1994 33.0
In [9]:
sdi.isnull().any()
Out[9]:
Year      False
SDI(%)    False
dtype: bool

Gross National Income per capita (GNIpc)¶

  • Drop the redundant, geo-id column.

  • Rename mislabeled columns:

    • 'time' as 'Year'
    • 'Income per person' as 'GNIpc (PPP$2017)'
In [10]:
gnipc.drop(columns='geo', inplace=True)
gnipc.columns = ['Year', 'GNIpc (PPP$2017)']
gnipc.head()
Out[10]:
Year GNIpc (PPP$2017)
name
Afghanistan 1800 603
Afghanistan 1801 603
Afghanistan 1802 603
Afghanistan 1803 603
Afghanistan 1804 603

Material Footprint per capita (MFpc)¶

  • Drop the redundant, geo-id column.

  • Rename mislabeled columns:

    • 'time' as 'Year'
    • 'Material footprint per capita (tonnes)' as 'MFpc (tons)'
In [11]:
mfpc.drop(columns='geo', inplace=True)
mfpc.columns = ['Year', 'MFpc (tons)']
mfpc.head()
Out[11]:
Year MFpc (tons)
name
Afghanistan 1990 2.46
Afghanistan 1991 2.81
Afghanistan 1992 2.06
Afghanistan 1993 1.87
Afghanistan 1994 1.60

Exploratory Data Analysis¶

Sustainable Developers: Top vs Bottom¶

In [12]:
sdi_latest = sdi.query('Year == 2019')

sdi_latest.sort_values(by='SDI(%)', ascending=False).head()
Out[12]:
Year SDI(%)
name
Costa Rica 2019 85.3
Sri Lanka 2019 84.3
Georgia 2019 83.9
Armenia 2019 82.7
Albania 2019 82.6

The top countries are Costa Rica, Sri Lanka, Georgia, Armenia and Albania.

In [13]:
sdi_latest.sort_values(by='SDI(%)', ascending=False).tail()
Out[13]:
Year SDI(%)
name
United States 2019 18.1
Australia 2019 15.0
United Arab Emirates 2019 11.0
Kuwait 2019 9.9
Singapore 2019 7.9

When it comes to sustainable development, Australia and the United States are among the bottom—along with other highly developed, highly industrialized countries.

Do highly developed countries tend to struggle with sustainable development?¶

In [14]:
# As highly developed countries tend strongly to be high income, the GNIpc data
# can be used to sus out the highly developed countries.
gnipc_latest = gnipc.query('Year == 2019')#['GNIpc (PPP$2017)']

sdi_latest = pd.merge(sdi_latest, gnipc_latest, how='left', on='name')
sdi_latest.head()
Out[14]:
Year_x SDI(%) Year_y GNIpc (PPP$2017)
name
Afghanistan 2019 55.1 2019 1763
Angola 2019 62.6 2019 5544
Albania 2019 82.6 2019 12694
United Arab Emirates 2019 11.0 2019 65650
Argentina 2019 77.7 2019 17529
In [15]:
sdi_latest.drop(['Year_x', 'Year_y'], axis=1, inplace=True) 
In [16]:
# There are more countries than can fit on one comparison chart. One solution
# is to systematically select a sample to represent the rest.
sample = sdi_latest.index[1::5]
sample
Out[16]:
Index(['Angola', 'Antigua and Barbuda', 'Belgium', 'Bahrain', 'Brazil',
       'Central African Republic', 'Cote d'Ivoire', 'Cape Verde', 'Germany',
       'Ecuador', 'Finland', 'Georgia', 'Guatemala', 'Indonesia', 'Iceland',
       'Japan', 'South Korea', 'Libya', 'Morocco', 'Macedonia, FYR',
       'Mozambique', 'Namibia', 'Norway', 'Panama', 'Portugal', 'Rwanda',
       'El Salvador', 'Slovenia', 'Chad', 'Trinidad and Tobago', 'Ukraine',
       'Vietnam', 'Zambia'],
      dtype='object', name='name')
In [17]:
sample = pd.Series(sample)

sdi_latest_sample = pd.merge(sdi_latest, sample, how='right', on='name')
sdi_latest_sample.head()
Out[17]:
name SDI(%) GNIpc (PPP$2017)
0 Angola 62.6 5544
1 Antigua and Barbuda 62.2 24463
2 Belgium 42.9 43517
3 Bahrain 48.8 41966
4 Brazil 75.4 14307
In [18]:
sdi_latest_sample.set_index('name', inplace=True)
In [19]:
sns.set(rc={"figure.figsize":(10, 7)})
sns.set_style('white')

fig = plt.figure()

# Arranging the countries by GNIpc allows for easier comparisons between highly
# developed countries on the one hand and everyone else on the other.
data = sdi_latest_sample.sort_values('GNIpc (PPP$2017)')

ax = sns.barplot(data=data, x=data.index, y='SDI(%)', palette="mako_r")

fig.suptitle('Sustainable Development', y=1.02, fontsize=20, fontweight='bold')
ax.set_title('A Comparison of Countries   ', fontsize=18)
ax.set_ylabel('Sustainable Development Index (%)', fontsize=12)
ax.set_xlabel('Country')
ax.set_xticks(np.arange(len(data)))
ax.set_xticklabels(data.index, rotation=45, ha='right')

ax2 = ax.twiny()

ax2.set_xlabel('GNIpc (%ile)')
ax2.set_xticks(np.arange(5), ['min', '25', '50', '75', 'max'])

mean_sdi = data['SDI(%)'].mean()
plt.axhline(mean_sdi, color='blueviolet', linestyle=':', label='mean SDI')
plt.legend(loc=(.79, .89));

It seems that highly developed countries do tend to struggle with sustainable development. This is exemplified by Norway, which has the highest GNIpc and the lowest SDI. Most of the other highly developed countries do not fare much better. Those that do are nonetheless well below the mean.

In [20]:
# Incorporating quartiles into the data itself allows for further, programmatic
# analysis.
quartiles = ['bottom', 'lower-middle', 'upper-middle', 'top']

data['GNIpc quartiles'] = pd.qcut(data['GNIpc (PPP$2017)'], 4, labels=quartiles)
In [21]:
# The bottom of the SDI barrel
sdi_q1 = data['SDI(%)'].quantile(0.25)

data.query('`SDI(%)` <= @sdi_q1').sort_values('SDI(%)', ascending=False)
Out[21]:
SDI(%) GNIpc (PPP$2017) GNIpc quartiles
name
Central African Republic 42.8 794 bottom
Chad 42.8 1743 bottom
Slovenia 42.5 33676 upper-middle
Germany 39.3 46173 top
Japan 30.5 39739 top
South Korea 27.4 37343 top
Iceland 23.2 47861 top
Finland 22.3 42383 top
Norway 19.7 66308 top

Most of the top-quartile countries re GNIpc are at the bottom of the Sustainable Development Index. They are joined there by Slovenia—a borderline, highly developed country.

The bottom quartile of the SDI is rounded out by Chad and the Central African Republic, which sit on the lower hinge itself (25ᵗʰ percentile). Even the least developed countries hold up better than the most developed ones in terms of SDI!

On the other hand, the midspread countries re GNIpc are at the top of the SDI.

Just how developed are the top SDI countries?¶

In [22]:
# The cream of the SDI crop—sorted by GNIpc
sdi_q3 = data['SDI(%)'].quantile(0.75)

data.query('`SDI(%)` > @sdi_q3')
Out[22]:
SDI(%) GNIpc (PPP$2017) GNIpc quartiles
name
Ukraine 75.8 8480 lower-middle
Ecuador 78.3 10215 lower-middle
Georgia 83.9 10671 lower-middle
Indonesia 77.1 12061 lower-middle
Brazil 75.4 14307 upper-middle
Libya 75.7 14751 upper-middle
Panama 82.1 23315 upper-middle
Trinidad and Tobago 74.2 28522 upper-middle

At the top of the SDI, Panama and Trinidad & Tobago stand out for their relative affluence. But they are not on the same level as the most developed countries.

In [23]:
data['GNIpc (PPP$2017)'].describe()
Out[23]:
count       33.000000
mean     19829.090909
std      17411.165500
min        794.000000
25%       6970.000000
50%      12061.000000
75%      33676.000000
max      66308.000000
Name: GNIpc (PPP$2017), dtype: float64

In fact, Panama's GNIpc is hardly above the mean average, despite being at almost twice the median. Such a pronounced difference indicates extreme right skew.

In [24]:
data['GNIpc (PPP$2017)'].skew()
Out[24]:
0.9013158753711067
In [25]:
# Plotly interactive plots are great for exploring data—they will be used from
# here on. But their default, title layout could be better.
pd.options.plotting.backend = "plotly"

def reconfig_title():
    'Reconfigure title layout of Plotly graph.'
    title = '<b>' + fig.layout.title.text + '</b>'
    fig.update_layout(title={'x':0.5,
                             'xanchor':'center',
                             'y':0.85,
                             'font_size':18,
                             'text':title,
                             });
In [26]:
# Since histograms have no problem fitting large amounts of data—the more the
# better—the full list of countries can be used.
fig = sdi_latest.plot.hist(x='GNIpc (PPP$2017)', histnorm='percent', marginal='box',
                     hover_name=sdi_latest.index, title='International'
                     ' Distribution of Income per Person')

reconfig_title()

fig.show()

The leftmost, three bins constitute more than three quarters of all countries, including high-SDI standouts Panama and Trinidad & Tobago.

The top quartile stretches from 29k to 113k international dollars (66k sans outliers). Per the country comparison above, these are the highly developed countries that struggle with sustainable development. To increase their SDIs—and stave off resource depletion, these countries need to slow down their rate of consumption.

How can a 29k country and a 113k country both be on the same tier of development?

The function of GNIpc is asymptotic as it pertains to human development. In other words, there are limits to human development that no amount of money can surpass. For example, literacy rate cannot exceed 100%. Similarly, life expectancy everywhere is limited by senescence (aging).

However, it may not be bearable to people in, say, Iceland to slow down their rate of consumption to the level of people in Panama—or even people in Bahrain, for that matter. Icelanders are just accustomed to a more luxurious standard of living than Panamanians and Bahrainians—even if Bahrain is a highly developed country. But perhaps there is a country as rich as Iceland yet more sustainable*, which can serve as a model for them.

*Or less unsustainable, at least.

As Rich as, yet More Sustainable¶

As a measure of sustainability, MFpc will suffice. In addition to being a measure of consumption, MFpc is indicative of overall environmental impact in most cases.

In [27]:
# The MFpc data only goes up to 2017.
mfpc_latest = mfpc.query('Year == 2017')['MFpc (tons)']

sdi_latest = pd.merge(sdi_latest, mfpc_latest, how='left', on='name')
sdi_latest.head()
Out[27]:
SDI(%) GNIpc (PPP$2017) MFpc (tons)
name
Afghanistan 55.1 1763 1.20
Angola 62.6 5544 3.34
Albania 82.6 12694 11.57
United Arab Emirates 11.0 65650 49.11
Argentina 77.7 17529 14.78

Which rich countries stand out, sustainability-wise, relative to their level of income?¶

In [28]:
# Note: A low MFpc suggests high sustainability.
x = sdi_latest['GNIpc (PPP$2017)']
y = sdi_latest['MFpc (tons)']

# A trendline can help visually separate the relatively lower MFpc countries
# all along the income scale—the better to find outliers with.
fig = sdi_latest.plot.scatter(y=y, x=x, trendline='ols', title='Material Foot'
                              'print vs Income', hover_name=sdi_latest.index)

reconfig_title()

# Likewise, a vertical line at the 3rd quartile can cordon off the high income
# countries. The area that they demarcate can be highlighted to further draw
# attention to it.
z = np.polyfit(x, y, 1)
f = np.poly1d(z)
gnipc_q3 = x.quantile(0.75)
gnipc_max = x.max()

fig.add_trace(go.Scatter(x=[gnipc_q3,
                            gnipc_q3,
                            gnipc_max,
                            gnipc_max,
                            gnipc_q3,
                            ],
                         y=[0,
                            f(gnipc_q3),
                            f(gnipc_max),
                            0,
                            0,
                            ],
                         mode='lines', fill='toself', fillcolor='green',
                         opacity=.2, hoverinfo='skip', showlegend=False))

fig.add_trace(go.Scatter(x=[gnipc_q3, gnipc_q3, gnipc_q3], y=[-3, 83, -3], 
                         mode='lines', hoverinfo='skip', name='Q3'))

The following, high-income countries stand out in terms of low material footprint, at least in relation to their income: Oman, Bahrain, Saudi Arabia, Brunei, Ireland, and Qatar.

In [29]:
sdi_latest.loc[['Oman', 'Bahrain', 'Saudi Arabia', 'Brunei', 'Ireland', 'Qatar']]
Out[29]:
SDI(%) GNIpc (PPP$2017) MFpc (tons)
name
Oman 63.1 35758 10.34
Bahrain 48.8 41966 14.37
Saudi Arabia 45.6 48115 12.33
Brunei 33.8 72376 20.18
Ireland 42.4 72413 21.50
Qatar 26.0 113331 12.82
In [30]:
sdi_latest.mean()
Out[30]:
SDI(%)                 57.280982
GNIpc (PPP$2017)    19105.883436
MFpc (tons)            13.227055
dtype: float64

Yet most of these countries have low SDIs:

  • Bahrain, Saudi Arabia and Qatar are of average MFpc, but their SDIs are considerably below average (especially Qatar's).
  • Brunei is almost identical to Ireland in terms of both GNIpc and MFpc, but its SDI is considerably lower.

Their low SDIs must be on account of unsustainability—it's certainly not for lack of development! Likewise, their unsustainability must be due to high, carbon-dioxide (CO₂) emissions. Recall, SDI accounts for both MFpc and per capita CO₂. Since material footprint is not to blame, carbon footprint is the culprit.

That makes sense. Bahrain, Saudi Arabia, Qatar, and Brunei are all petrostates. The petroleum industry is the poster-child of carbon-dioxide emissions, but because petrostates export most of their petroleum products, their material footprints are largely unaffected. The exports are counted as part of the material footprints of the importing countries.

And then There were Two

Of the remaining two—Oman and Ireland—one is a crude-oil petrostate with a deceptively small, carbon footprint.* And the other is a corporate tax haven with an inflated GNI.

*Oman largely punts the refining process to other countries, which is where most of the CO₂ emissions occur.

Conclusions¶

Summary of Findings¶

  • Highly developed countries tend to struggle with sustainable development.
  • Underdeveloped countries struggle with sustainability too, making even less efficient use of resources—albeit at a much smaller scale.
  • The countries in between are the best at sustainable development. They live relatively efficiently and within their means.
  • The function of GNIpc is asymptotic as it pertains to human development.
    • Thus, a country can be top-tier in terms of development, without being top-tier in terms of income per capita.
  • No rich countries stand out in terms of sustainability, not even relative to their level of per capita income.*

*At least not without offloading the environmental costs or performing accounting sleight of hand (see insert above).

Concluding Remarks¶

The standard for success must not be one of luxury but rather, socioeconomic agency and financial security. These, along with health and education, will be the lodestones of human progress towards a sustainable future.

Next Steps for Future Investigations¶

  • Incorporate CO₂ data to better understand overall environmental impact.
  • Use HDI instead of GNIpc to directly measure human development and more precisely delineate the highly developed countries.
  • Graph the relationships between HDI and MFpc, CO₂pc, and GNIpc to better understand the costs and benefits of various, high levels of development.